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Abstract 

Background: The frequent exchange of genetic material among prokaryotes means that extracting a majority 
or plurality phylogenetic signal from many gene families, and the identification of gene families that are in 
significant conflict with the plurality signal is a frequent task in comparative genomics, and especially in 
phylogenomic analyses. Decomposition of gene trees into embedded quartets (unrooted trees each with four taxa) 
is a convenient and statistically powerful technique to address this challenging problem. This approach was 
shown to be useful in several studies of completely sequenced microbial genomes. 

Results: We present here a web server that takes a collection of gene phylogenies, decomposes them into 
quartets, generates a Quartet Spectrum, and draws a split network. Users are also provided with various data 
download options for further analyses. Each gene phylogeny is to be represented by an assessment of 
phylogenetic information content, such as sets of trees reconstructed from bootstrap replicates or sampled from 
a posterior distribution. The Quartet Decomposition server is accessible at http://quartets.uga.edu. 

Conclusions: The Quartet Decomposition server presented here provides a convenient means to perform Quartet 
Decomposition analyses and will empower users to find statistically supported phylogenetic conflicts. 



Background 

Sequence data revealed that genetic material in prokar- 
yotes (bacteria and archaea) can be transferred between 
divergent organisms [1] to an extent that makes it diffi- 
cult to reconstruct their evolutionary history [2-4]. Many 
microorganisms can take DNA directly from the envi- 
ronment; phages infect prokaryotic cells and may bring 
new DNA fragments into the host genomes; the conju- 
gation machinery allows for DNA exchange directly be- 
tween cells; and phage derived gene transfer agents [5] 
were suggested to transfer genetic material between re- 
lated and possibly unrelated organisms [6]. Gene transfer 
results in genes found in the same genome to have dif- 
ferent phylogenies. The currently popular strategies for 
inference of organismal relationships include (i) construc- 
tion of an organismal tree based on conserved genes 
presumed to be not transferred such as 16S ribosomal 
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RNA and ribosomal proteins, or (ii) the assumption that 
the plurality phylogenetic signal contained in all genes 
reflects the organismal history. The plurality signal is ei- 
ther extracted through joint analysis of several genes, 
usually after removing genes that show signs of having 
been horizontally transferred [7], or individual gene trees 
are combined using a variety of supertree approaches 
[8,9]. 

Phylogeny is typically represented as a tree, often with 
tens or hundreds of leaves. The large size and unequal 
number of taxa makes comparisons between trees diffi- 
cult. A common approach is to compare all significantly 
supported bipartitions. Lento plots allow visualizing the 
bipartitions supported by many gene families, and also 
depict, for each bipartition, all those bipartitions that are 
in conflict [10-12]. As well as requiring all phylogenies 
to be the same size Le., all gene families represented in 
all genomes analyzed, bipartition-based approaches suffer 
from a loss of resolution as more sequences and there- 
fore tips and edges are included. Quartet Decomposition 
avoids both of these problems [13,14]. 
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Quartets are unrooted trees consisting of four taxa 
(Figure 1). A quartet is the minimal informative unit in a 
tree, and it has three possible topologies. An unrooted 
three-taxon tree unit only has one topology and thus is 
not informative, while a five-taxon tree unit has fifteen 
topologies, thus is too complicated; the four-taxon tree 
unit has a good balance between the amount of informa- 
tion it can carry and the complexity involved in analyz- 
ing it [15]. Quartet Decomposition is the analysis of 
quartets embedded in larger phylogenies. 

Support for bipartitions that include all taxa present in 
a phylogenetic tree can decrease, if one sequence in a 
larger phylogeny has low phylogenetic signal causing its 
position among bootstrap replicates to vary. In addition, 
as more taxa are added to an analysis, the shorter the in- 
ternal branches, and the lower their support values be- 
come. This situation is unsatisfactory, because increased 
taxon sampling is expected to increase the reliability of 
the phylogenetic reconstruction; however, the increase in 
reliability is not reflected in increased bipartition support 
values. To illustrate this paradox we performed simula- 
tions summarized in Figure 2. Figure 2A shows how the 
simulation is performed: starting from a tree with four 
tips, we grow the tree by adding more tips at the internal 
branch; and then generate replicates, carry out biparti- 
tion and quartet-based analysis. Figure 2B shows that 
even for sequences 1000 amino acids long, with 10 ad- 
ditional tips, the maximum support for a bipartition sep- 
arating AB from CD is less than 80% on average, and 
with 20 additional tips it is close to 60%, too low to pro- 
vide insight into any biological processes. In contrast, 
Figure 2C shows the ((A,B),(C,D)) embedded quartet is 
present in almost all replicates, demonstrating the near 
independence of sample size and embedded quartet 
resolution. 

The use of quartets has been explored in various 
phylogenetic applications. In 1996 K. Strimmer and A. 
von Haeseler developed the quartet puzzling algorithm 
for tree reconstruction [20]. Since then a quartet-based 
software TREE-PUZZLE [21] has been developed and 
widely used for tree reconstruction from DNA and pro- 
tein sequences. Later, two software packages, Clann [22] 
and QuartetSuite [23], were developed allowing con- 
struction of supertrees from multiple trees using quar- 
tets. Zhaxybayeva and Gogarten [24] introduced the use 
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Figure 1 Quartet topologies. The three possible quartet 
topologies for four taxa A, B, C and D. 





of embedded quartets to solve the taxon-sampling prob- 
lem usually associated with quartet based analyses [25], 
and used the analysis of embedded Quartet De- 
composition to examine gene histories in cyanobacteria, 
and to identify horizontally transferred genes [13,14]. 
Boc et al. recently developed a Horizontal Gene Transfer 
(HGT) detection algorithm that uses a quartet-based dis- 
tance as one of the criteria when reconciling gene and 
organismal phylogenies [26]. Quartet analysis is also a 
good choice for multi-locus sequence data analysis 
[27], and has been used to infer taxonomic relationships 
[28,29] as well as tree-like and net-like evolutionary pro- 
cesses [30]. 

To facilitate a wider application of Quartet Decompos- 
ition, we present a web-based platform for decomposing 
a given set of trees into quartets. The web server also 
provides several quartet-based analysis tools such as 
quartet spectrum generation, agreement score calcula- 
tion, and split network generation. Considering that a 
user may want to carry out additional analyses of the 
quartets, we also provide several options to download 
the computed quartets. 

Given a gene tree, our algorithm enumerates all pos- 
sible combinations M of any four out of x total taxa 
under consideration, 

M=(l) (1) 

Lets use A, B, C and D to represent the four taxa in a 
specific embedded quartet of the full phylogenetic tree. 
In order to determine what specific topology the embed- 
ded quartet has, we calculate pairwise distances d AB , 
d AC , d AD , d BC , d BI) and d C r» where the distance d X \ is 
defined as the sum of all branch lengths in the given tree 
from leaf X to leaf Y. If (d AC + d BD )-(d AB + d C r>) > 0, the 
quartet has topology TOPI (Figure 1); if Wad + ^bc)- 
(d AC + ^bd) > 0 - topology TOP2 ; and if (d AB + d C r>)- 
(d AD + d BC ) > 0 - topology TOP3. Each branch of the 
embedded quartet may correspond to several internal 
edges of the full phylogeny and has a length calculated as 
exemplified for topology TOPI (Figure 1): the length of 
the internal branch is d interna i= [(d AC + d BD )-(d AB + d CD )]/2, 
and the length of the external branch of taxon A is d A = 
[(d AC + d AD )- dcoV2- d internah The lengths of other ex- 
ternal branches are calculated similarly. 

Implementation 

The server is implemented on a computer running 
Linux RedHat Enterprise 5.0 operating system. Apache 
2.2.9 is used as the web server, and PHP 5.2.6 is used to 
develop dynamic webpages. Scripts implementing the 
server functions are written in Perl. The BioPerl 1.60 
[31] TreelO module is used to help compute the 
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Figure 2 Comparison of the performance of bipartiton and quartet-based analyses. Increasing taxon sampling justifiably is expected to 
increase the reliability of phylogenetic reconstruction; however, the support values for bipartitions that include all taxa tends to drop as more 
taxa are added. Panel A depicts the phylogenies used for simulations. Starting with an unrooted tree of four leaves, ((A,B),(C,D)) and an internal 
branch of 0.01 average substitutions per site, we added 1, 4, 9, 19 and 49 additional leaves to the internal branch. Simulations for each topology 
were performed with Seq-Gen [16] using the indicated trees, the WAG substitution matrix [17] and a T distribution with a shape parameter of 
1 approximated by four discrete rate categories for the rate distribution. SEQBOOT from the PHYLIP package [18] was used to generate 
100 bootstrap sequences and trees were reconstructed from each bootstrap sample using FastTree 2.1 [19] using the same model for sequence 
evolution and parameters "-spr 4", "-mlacc 2", and "-slownni" for increased reconstruction accuracy. For each topology the evolution of sequences 
of varying lengths (200, 500 and 1000 amino acids) was simulated. For each of the simulated data sets, we generated 100 bootstrap replicates 
and recorded the maximum support for a bipartition separating (A,B) from (C,D) (Panel B) and the bootstrap support for the embedded quartet 
((A,B),(C,D)) for all simulations (Panel C). Error bars give the standard error of the mean from 100 replicates each. 
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decomposition of an input tree, and the Perl graphic li- 
brary GD is used to draw the quartet spectrum. Split- 
Tree4 [32] is used to generate the split network. A Linux 
computer cluster with 8 nodes which can support 32 
simultaneous jobs is used as the backend for tree de- 
composition calculation. The Sun Grid Engine 6.2 is 
used for job management. 

The overall structure of the server is illustrated in 
Figure 3. A user needs to prepare two input files: one 
containing the names of the genomes or taxa under 
consideration, the other is a compressed file of all gene 
trees (currently the server will accept .tar.gz, .rar and .zip 
files). Each gene tree is represented by multiple trees that 
assess phylogenetic information content, such as sets of 
trees reconstructed from bootstrap replicates or sampled 
from a posterior distribution. We also provide an inter- 
face for users to generate bootstrap replicates from mul- 
tiple sequence alignment. The replicates are generated 
by a BioPerl utility function, and the trees are generated 
by FastTree 2.1 [19]. Since we are comparing quartets 
across gene families to obtain a plurality signal, taxa 
labels corresponding to genes in the same organism are 
expected to have the same name. To facilitate the re- 
placement of gene identifiers with the names of the ge- 
nomes, we provide Perl scripts (see FAQ section in the 
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Figure 3 Data flow of quartet decomposition analyses using 
the QD server. Steps labeled a-g are described in detail in the 
Results and Discussion section. The boxes outlined with green 
border are parameterized filters, which can be applied multiple 
times to generate different quartet spectra. The green arrows 
represent the repeatable steps. 



server) for conversion and consistency checks. These 
scripts require BioPerl 1.60 or newer version on the 
users computer. If the user does not have BioPerl in- 
stalled in their local computer, we also provide a web 
interface for the user to do the name conversion in the 
server. 

After the name conversion, the user can upload the 
files to the server, specify the parameter values (or just 
use the default parameter values given by the server), 
and start the decomposition calculation. The computa- 
tion may take several hours depending on the number of 
taxa, the number of gene families and the number of 
trees per gene family. For example, when we provided 
trees from 100 bootstrap samples for each gene family, it 
took 2 hours and 10 minutes for a job with 1128 gene 
families from 10 genomes, and 15 hours and 21 minutes 
for a job with 1734 gene families from 19 genomes. The 
run time is heavily dependent on the number of gen- 
omes since the number of quartets is a fourth degree 
polynomial of this number. Due to the limitations of 
computer hardware housing the server at the time of 
writing (May 2012), we suggest the user not to submit a 
job with more than 20 genomes. However, the server 
will accept a job with up to 100 genomes, issuing a 
warning for a job with more than 40 genomes. The user 
can refresh the job status page while the job is running: 
the server will display the currently analyzed gene fam- 
ily. The server will send an email to the user with a link 
to the status page once the job is submitted; and it will 
send another email after the job is completed. After the 
decomposition is done, a quartet spectrum [14] (see next 
section for its description) will be generated, and the 
user can run various analyses using tools provided by 
server, such as filtering quartets, calculating an agree- 
ment score, downloading a specified subset of the de- 
composed quartets, and generating a splits network. 

Results and discussion 

The server provides a platform for performing the fol- 
lowing quartet-based analyses. 

Quartet spectrum generation 

Quartet Decomposition of a gene tree is the process of 
finding all possible embedded quartet topologies for a 
given tree. For a given list of genomes and multiple gene 
families collected from these genomes, the quartet top- 
ologies in a specific gene family are identified, and for 
the set of taxa summarized in a quartet spectrum. The 
calculation consists of the following steps (the user needs 
to perform steps a-c, the server performs steps d-g): 

a. For a set of genomes of interest, assemble and align 
gene families, and obtain trees either from bootstrap 
replicates or from a posterior distribution. 
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b. Prepare trees in Newick format for each gene 
family. Put all trees for the same gene family to 
one file. Compress all tree files to a single file. 

c. Upload genome list and the compressed tree file 
to the server. Specify the parameters for filters 
(see below). Start the job. 

d. Decompose each tree into embedded quartets. 

e. For each gene family, calculate the support value 
for the three topologies of each quartet by counting 
the fraction of the bootstrap trees that contain this 
quartet topology. In case of 100 replicate trees, 
each embedded quartet in a family has a dominant 
topology with a maximum score of 100. Comparable 
scores for the alternative quartet topologies, such as 
34, 33, 33, are indicative of no or little phylogenetic 
signal for that embedded quartet in a particular 
gene phylogeny. 

f. For each quartet, determine the plurality topology 
across all gene families as follows: given a threshold 
for a support value cut-off to determine whether 
the dominant topology is supported (85%, 90% and 
95% are currently supported by the server), count 
the number of gene families supporting each of 
the three topologies. The topology with the highest 
number of supporting gene families is considered 
the plurality topology of the quartet among all the 
analyzed gene families. 

g. Sort the quartets by the number of gene families 
supporting the plurality topologies, and plot as a 
histogram with these sorted numbers along with the 
labels of the associated quartets. Analogous to the 
Lento plot [10], another histogram on the negative 
side of the Y-axis is also added to show the sum 

of the two non-plurality topologies (conflicting 
topologies) for each quartet. The resulting diagram 
is called the quartet spectrum (Figure 4). 

The quartet spectrum provided by the server is inter- 
active: when a user clicks on the bar representing a 
specific quartet, a new page pops-up with the detailed 
information for that quartet, including its support value 
in each gene family. 

Sometimes a user may prefer to compare the individ- 
ual gene phylogenies against another tree obtained from 
other sources, such as phylogenies calculated from ribo- 
somal components [33], the Tree of Life Project (http:// 
tolweb.org/tree/), or the NCBI taxonomy database [34] 
(http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax. 
cgi). The server can compare the quartets in the gene 
families against the quartet topologies embedded in the 
reference tree and generate a quartet spectrum counting 
the quartet topologies in the reference tree as positive. 
Large values in the negative part of the spectrum would 
indicate specific conflicts between gene phylogenies and 



the reference tree. The presence of at least one embed- 
ded quartet with a bootstrap support value greater than 
80 in conflict with a reference phylogeny reveals a sig- 
nificant phylogenetic conflict suggestive of an HGT event. 
Depending on the data analyzed, alternative explanations 
for phylogenetic conflict may need to be considered. 
Lineage sorting occurs in taxa with large populations 
and a rapid succession of speciation events; unrec- 
ognized paralogy always is an alternative explanation to 
HGT [35] and needs to be considered when independent 
and parallel gene loss cannot be excluded because only 
few lineages are analyzed. While the rate of false posi- 
tives is reasonably assessed through the bootstrap sup- 
port values [14,36], the rate of false negatives likely is 
large, especially for transfers between close relatives [37]. 

Processing of paralogs 

If there are paralogs in a gene family (and hence mul- 
tiple homologs per gene family have the same label), the 
distribution of quartet topologies will be calculated as 
follows. Given a tree and four genomes A, B, C and D, 
the number of paralogs are a f b, c and d for each ge- 
nome respectively. The total number of quartet topolo- 
gies with the four genomes will be t = axbxcxd. Since 
each topology will represent one of TOPI, TOP2 or 
TOP3 (see Figure 1), we can count the total number of 
quartet topologies with TOPI, TOP2 and TOP3 as t h t 2 
and t 3 . The sum of t 1} t 2 and t 3 is equal to t. For the 
given tree, we calculate the ratio of TOPI, TOP2 and 
TOP3 as tj/t, t 2 lt and t 3 /t, respectively. The sum of the 
three ratios will be equal to 1, which is the same for a 
tree without paralogs. In addition, quartets with two tips 
from same genome (Le,, paralogs) will be ignored. If 
gene families with paralogs are included in a quartet de- 
composition analysis, conflicting quartets may reflect the 
gene duplication events, and can no longer be identified 
with gene transfer events. However, families with para- 
logs are useful to extract the plurality phylogenetic signal 
contained in a set of genomes. 

Agreement score calculation 

For each gene family we also calculate an agreement 
score [13], which measures how well the gene family 
agrees with the plurality or the reference tree: 

M 

c i=1 M\ 



where N is the number of trees for this gene family; M is 
the number of possible quartets; and rii is the number of 
topologies that agree with the plurality (or reference) for 
the i th quartet. The score S is equal to 1 if all the trees 
have the same topology which is also identical to the 
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Quartets 

Figure 4 An example of a quartet spectrum. The x-axis 
represents the quartets, one per column, arranged in descending 
order of the number of gene families supporting the plurality/ 
reference topology of each quartet. For each column, the y-axis 
represents the number of gene families in which that quartet 
supports (positive y values, one topology) or conflicts (negative y 
values, the other two possible topologies) with the plurality or 
reference topology. For the conflicts, the y value represents a sum of 
gene families supporting the two other topologies. The spectrum is 
color-coded according to different bootstrap support thresholds 
used. 



reference, and it is less than 1 otherwise. The more con- 
flicts between the gene trees and the reference are ob- 
served, the closer the score is to 0. 

Filters 

The inaccuracies in phylogenetic reconstruction may 
introduce noise and misleading information to quartet 
analysis. To minimize their impact, we designed three 
filters to remove such quartets, categorized as follows. 

Long external branch (es) 

Each quartet has four external branches and one internal 
branch (Figure 1). Long external branches may lead to 
the so called long branch attraction artifact [38], which 
may erroneously lead to the conclusion that two rapidly 
evolving lineages are closely related. A filter is imple- 
mented to remove quartets with long external branches 
according to the following criterion: if the ratio between 
the longest external branch and the internal branch is 
larger than a pre-set threshold (default value is 10), it 
will be removed. 

Short internal branch 

If a quartet has a very short internal branch, there may 
not be enough phylogenetic information to resolve the 
topology correctly. The server provides an option to re- 
move a quartet if its internal branch is shorter than a 



pre-set threshold (default value is 0.02 substitutions per 
site). If the branch length in the tree is not measured by 
substitutions per site, 0.02 may not be an appropriate 
value, and the user has to decide a proper value by 
himself. 

Less supported quartets 

Quartets that due to a lack of phylogenetic signal are 
poorly resolved in most gene families could result in er- 
roneous but significant conflicts with the plurality (false 
positives) [14]. To remove quartets that are not resolved 
by most gene families, we implemented the following fil- 
ter, defined by two thresholds, Tj (ranges between 0% 
and 100%) and T 2 (a positive integer). For a specific 
quartet, if the proportion of the gene families supporting 
it with a support value of at least Tj is less than T 2 , this 
quartet will be removed from a quartet spectrum. This 
filter is applied after the decomposition process is done, 
and the effect of different filter settings on the quartet 
spectrum can be explored. In contrast, the other two fil- 
ters have to be specified before the decomposition pro- 
cess starts. 

Splits network generation 

A splits network is a network representation of the rela- 
tionship of a set of taxa [39], in which multiple alterna- 
tive splits (and not just the most supported one) are 
depicted. In situations with frequent exchanges of gen- 
etic material, a split network is a better representation 
for the taxa relationship than a tree. Our server can 
convert any quartet subset (see next section for a de- 
scription of quartet sets) to a matrix [40,41], and then 
generate a split network by using the SplitTree4 pro- 
gram [32]. 

Quartet download 

Although we have provided a number of quartet analysis 
tools through the server, a user may want to perform 
his/her own analyses on the computed quartets. We of- 
fer two options to download the decomposed quartets. 

The first option is to download a subset of the quar- 
tets that are supported with a support value of at least 
T 1 in at least T 2 gene families (see section on filters for 
descriptions of T 1 and T 2 ). The second option is based 
on the quartet spectrum. The quartet topologies in agree- 
ment with the plurality are considered as plurality quartet 
topologies, and as conflicting quartet topologies other- 
wise. The user can obtain the subsets of plurality or con- 
flicting quartet topologies using thresholds T 1 and T 2 as 
described above. 

Examples 

We provide two quartet decomposition examples, which 
can be accessed from the Frequently Asked Questions 
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section on the quartet server web page. Both the data 
sets and the quartet spectrum are available on the ser- 
ver. The user can run the job by using the data sets, or 
go directly to the quartet spectrum and explore other 
analyses on the server. 

One data set consists of 1,128 gene families present in 
at least 9 of 11 selected cyanobacterial genomes [14]. 
Quartet Decomposition of these families revealed that 
cyanobacterial evolution is incompatible with strictly bi- 
furcating tree and helped to pinpoint specific cases of 
horizontal gene transfer. 

The other data set consists of 1,812 gene families 
present in at least 4 of 18 specific cyanobacterial genomes 
of Prochlorococcus marinus and marine Synechococcus 
spp. [13]. Quartet Decomposition identified 495 gene 
families that did not separate genera Prochlorococcus 
and Synechococcus as expected. This observation can be 
explained by the existence of a "highway of gene sharing" 
between marine Synechococcus spp. and low-light adapted 
Prochlorococcus spp. (see [13] for additional discussion). 

In both studies the Quartet Decomposition has proven 
to be an invaluable tool for identification of phylogenetic 
signal shared by genes in analyzed genomes and for dis- 
covery of horizontally transferred genes. 

Conclusion 

The Quartet Decomposition server presented here pro- 
vides an interactive interface to dissect complex evolu- 
tionary histories of microbial genomes. We believe that 
this online service will be a valuable tool for the com- 
parative genomics community. 

Availability and requirements 

Project name: Quartet Decomposition server. 
Project home page: http://quartets.uga.edu. 
Operating system(s): Platform independent 
Other requirements: The server has been tested using 
Firefox (Windows, Linux and Mac OS X), Internet Ex- 
plorer (Windows), Safari (MacOS X Lion), and Google 
Chrome (Windows and Linux) browsers. 
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